library(rvest)
library(tidyverse)
library(knitr)
library(plyr)
library(dplyr)
library(jsonlite)
library(lubridate)
library(RSelenium)

Scrape all departures or arrivals from FlightRadar website

We see two methods to capture departures and arrivals data for airport on FlightRadar website, using an headless browser, and using XHR request.

For each airports page, FlightRadar website offer the possibility to see general informations, departures and arrivals flights information. For this tutorial we try to scrape the Bordeaux Mérignac Airport BOD departures data page and arrival flights page

As you could see if you go to departures pages, you have two interesting buttons, one at the top of the page, and one at the bottom of the page.

To display all data available (something like 24h of past and future departures/arrivals), we simulate multiples clic on this two buttons, and we stop this behavior only when this buttons disapear from the page.

Using Selenium headless browser

Due to some defence created by webmaster to protect some data, you need to simulate an human behavior, if possible using a real browser.

To be short, Selenium is a multi-tools project focusing on task automation to test web aplication. It works with lots of Internet browsers, and lot of operating systems.

In short, Selenium Webdriver give to developper an API to interact/pilot an headless internet browser without opening it. So, you, developper, you could use this API with your favorite langage (Java, Python, R, etc.) to sent commands to browser in order to navigate, move your mouse, click on DOM element, sent keyboard output to input forms, inject javascript, capture image of the page, extract html, etc.

First, you need to install and load RSelenium package, the R bindings library for Selenium Webdriver API :

install.packages("devtools")
devtools::install_github("ropensci/RSelenium")

Depending of your existing configuration and OS you probably need to install some dependent software packages.

It’s possible to use directly Selenium with your browser, but we prefer to use directly a server version. Why ? Because using server version of Selenium, you have the possibility a) to sent command on local or remote server running Selenium b) which run a different browsers and/or OS, c) to distribute tests over multiple machines.

Selenium is a fast moving project, and some release are really buggy, so try to choose a stable version, and don’t deseperate.

Run a Selenium server

Install Docker on your OS using docker documentation at the bottom of this document.

When it’s done, we pull and run one of Docker Selenium-Server image using terminal. For this tutorial we use Firefox !

In classic context (good internet connection), we pull images directly from Docker Hub server.

sudo docker pull selenium/standalone-firefox:3.14.0-arsenic

But, because the image is heavy in size (1GO for the two images used in this tutorial), we prefer to directly load the image given by USB key by your teachers. Open a terminal on the folder where located the images.

sudo docker load --input=r-alpine.tar
sudo docker load --input=rSelenium.tar

Create the Selenium container :

sudo docker run --shm-size=2g --name selenium -d -p 4445:4444 selenium/standalone-firefox:3.14.0-arsenic

Type sudo docker ps to see if server correctly run and listen to port 4445.

Connect to Selenium Server

Connect and open the browser on the server.

remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L)
remDr$open()
## [1] "Connecting to remote server"
## $acceptInsecureCerts
## [1] FALSE
## 
## $browserName
## [1] "firefox"
## 
## $browserVersion
## [1] "61.0.1"
## 
## $`moz:accessibilityChecks`
## [1] FALSE
## 
## $`moz:headless`
## [1] FALSE
## 
## $`moz:processID`
## [1] 288
## 
## $`moz:profile`
## [1] "/tmp/rust_mozprofile.xqLsGR2Zk891"
## 
## $`moz:useNonSpecCompliantPointerOrigin`
## [1] FALSE
## 
## $`moz:webdriverClick`
## [1] TRUE
## 
## $pageLoadStrategy
## [1] "normal"
## 
## $platformName
## [1] "linux"
## 
## $platformVersion
## [1] "4.15.0-34-generic"
## 
## $rotatable
## [1] FALSE
## 
## $timeouts
## $timeouts$implicit
## [1] 0
## 
## $timeouts$pageLoad
## [1] 300000
## 
## $timeouts$script
## [1] 30000
## 
## 
## $webdriver.remote.sessionid
## [1] "2d803d97-f908-44c6-afee-e3ecb1d7268b"
## 
## $id
## [1] "2d803d97-f908-44c6-afee-e3ecb1d7268b"
remDr$maxWindowSize()

Basic command for RSelenium

Johnd Harrison, the creator and first commiter of RSelenium binding library for Selenium, create a big tutorial with lots of commands covered : https://rpubs.com/johndharrison/RSelenium-Basics

Some of them :

  • remDr$maxWindowSize() : maximize windows of the browser.
  • remDr$navigate("https://www.google.fr") : navigate to url
  • remDr$screenshot(display = TRUE) : take a screenshoot of the webpage and display it in RStudio Viewer
  • remDr$findElement(...) : Find and element in the html structure, using different method : xpath, css, etc.
  • remDr$executeScript(...) : Execute a js script in the remote browser

Analyze html page structure !

Open Web Developer tools in your favorite browser on the arrivals webpage of BOD : https://www.flightradar24.com/data/airports/bod/arrivals

We investigate what happens in the html code when the load earlier or load later button . Why we do that ? To understand how we could automate things.

Because we want to automate clic on this two buttons, so we need to understand WHEN we need to stop clicking :) If we clic an infinite number of time, an error probably trigger when one of the two button disapear.

Select the Selector tools (sic) and click on the load earlier flights button.

If you clic the right thing, normaly you have highlighted some part of the html code which interest us :

Now, Iif you highlight and clic with the web tool selector on the load later flights button, you have something like that :

Things are not so very differences between this two buttons objects. It seems that only the timestamp, the data page number and the button text change …

Hightlight and clic one more time on the load earlier flights button. Clic another time to load a new page of data. You see that the html code change during the data load to desactivate clic on the button. Not so interesting. Now repeat the clic and stop only when the button disapear on your screen.

Great, a new css style attribute appear to indicate that now this button object is hidden : style="display: none;"

How could we re-use this important information during data harvesting to detect if the button is activated/desactivated ? The best solution was to use XPATH query !

Load the page in the selenium server

remDr$navigate("https://www.flightradar24.com/data/airports/bod/arrivals")
Sys.sleep(5) # time to load !
remDr$screenshot(file = "screenshoot.png")

Building XPATH correct expression could be difficult. A good way to test validity of your XPATH expressions was to use an interactive way, with web developper console.

Clic on console tab :

Type this in the console : $x("//button[@class='btn btn-table-action btn-flights-load']")

The result is an interactive array you could develop as a tree if you want.

Clic Clic Clic to make disapear one of the loading button, and now we trying to select only the available button. XPATH understand boolean operator (or,and, etc.) so we filter by @class and style :

$x("//button[@class='btn btn-table-action btn-flights-load' and contains(@style,'display: none;')]")

Great, this query return only the valid button. We use later this query to stop our loop of infernal button clic.

Now we try to build this query using RSelenium with findElement() function :

loadmorebutton <- remDr$findElements(using = 'xpath', "//button[@class='btn btn-table-action btn-flights-load' and not(contains(@style,'display: none;'))]")

Display the text of each element retrieved by function findElements() using the getElementText() function

unlist(lapply(loadmorebutton, function(x){x$getElementText()}))
## [1] "Load earlier flights" "Load later flights"

Now, how to simulate a clic on one of this button ?

An easy way was to call clickElement() function on the first loadmorebutton webelement :

tryCatch({
suppressMessages({
  loadmorebutton[[1]]$clickElement()})},
error = function(e) {
    loadmorebutton[[1]]$errorDetails()$message
  })

This command return an error message (if not, you’re lucky !), not very explicit, so if you want more details, you could call the function errorDetails() like our trycatch block.

An element of the webpage overlapp our button, so browser say us that’s not possible to clic on this webelement. Use snapshot function to see the page :

remDr$screenshot(file = 'screenshoot_overlap.png' )

If we hide these elements using XPath and javascript injection, everything goes to normal. First we accept cookies.

hideCookie <- function (x){
  cookiesButton <- x$findElement(using = 'xpath',"//div[@class='important-banner__close']") 
  cookiesButton$clickElement()
}

hideCookie(remDr)
remDr$screenshot(file = 'screenshoot_hide.png')

The navbar element create problem, so we hide it using javascript injection :

hideNavBar <- function (x) {
  script <- "document.getElementById('navContainer').hidden = true;"
  x$executeScript(script)
}
hideNavBar(remDr)
## list()

Now you can clickElement() without problem :)

tryCatch({
suppressMessages({
  loadmorebutton[[1]]$clickElement()})},
error = function(e) {
    remDr$errorDetails()$message
  })

See changes before and after using remDr$screenshot(display = TRUE) command

Exercices

  • Create a function which clic on button until they all disapears :)
  • Extract data for only one day using XPATH and Rvest !

Using XHR request

https://www.w3schools.com/js/js_json_http.asp

Sometimes, a defence is also a point of vulnerability. Many site use an internal API to query and feed website.

We try to see if this is the case with flight radar :)

Open the dev tools in the browser, clic on Network tab, then XHR tab.

Lucky guy/girl, do you see it ? Each GET query call an aiport.json file on the server :

https://api.flightradar24.com/common/v1/airport.json?code=bod&plugin[]=&plugin-setting[schedule][mode]=&plugin-setting[schedule][timestamp]=1537297562&page=1&limit=100&token=

If we decompose the query, we have : - an airport code : bod - a timestamp : 1537297562 - a page number : 1 - a limit by page : 100

Copy paste this url in your browser to see how the result json is structured. Insteresting data is located into schedule result > response > airport > arrivals : - item : number of total items - page : actual page and number of page - timestamp : date of capture - data : a list of 100 flights corresponding to actual page

We download and convert json data to data.frame using the jsonlite wonderfull package :) Why wonderfull ? Because jsonlite had an option to flatten the structure of json which normally contain data.frame into data.fram into data.frame …

timestamp <- as.numeric(as.POSIXct(now()))

url <- paste("https://api.flightradar24.com/common/v1/airport.json?code=bod&plugin[]=&plugin-setting[schedule][mode]=&plugin-setting[schedule][timestamp]=",timestamp,"&page=1&limit=100&token=",sep="")

# https://cran.r-project.org/web/packages/jsonlite/vignettes/json-aaquickstart.html
json <- jsonlite::fromJSON(url,flatten = T) 
pageOfData <- json$result$response$airport$pluginData$schedule$arrivals$data 
filteredData <- pageOfData %>% select(flight.airline.code.icao, flight.airline.name, flight.airport.origin.name, flight.airport.origin.code.icao, flight.airport.origin.position.latitude, flight.airport.origin.position.longitude) 

filteredData <- rename(filteredData, c(flight.airline.code.icao = "ICAO", flight.airline.name= "Name", flight.airport.origin.name = "Origin", flight.airport.origin.code.icao="Origin ICAO", flight.airport.origin.position.latitude = "Latitude",flight.airport.origin.position.longitude = "Longitude" ))

knitr::kable(filteredData, caption = "page 1 of arrival for BOD")
page 1 of arrival for BOD
ICAO Name Origin Origin ICAO Latitude Longitude
VOE Volotea Dubrovnik Airport LDDU 42.56135 18.268240
RYR Ryanair Milan Bergamo Il Caravaggio International Airport LIME 45.67388 9.704166
AFR Air France Paris Charles de Gaulle Airport LFPG 49.01252 2.555752
IBE Iberia Regional Madrid Barajas Airport LEMD 40.49355 -3.566760
AFR Air France Bastia Poretta Airport LFKB 42.55000 9.484722
DLH Lufthansa Frankfurt Airport EDDF 50.02642 8.543125
BMS Blue Air Bucharest Henri Coanda International Airport LROP 44.57216 26.102171
VOE Volotea Palma de Mallorca Airport LEPA 39.55167 2.738808
RAM Royal Air Maroc Marrakesh Menara Airport GMMX 31.60688 -8.036300
BAW British Airways London Gatwick Airport EGKK 51.14805 -0.190270
EZY easyJet Bristol Airport EGGD 51.38266 -2.719080
RYR Ryanair Rome Ciampino Airport LIRA 41.79936 12.594930
EZY EasyJet Geneva International Airport LSGG 46.23806 6.108950
HOP HOP! Ajaccio Napoleon Bonaparte Airport LFKJ 41.92388 8.802500
EZY EasyJet London Gatwick Airport EGKK 51.14805 -0.190270
EZY easyJet Tel Aviv Ben Gurion International Airport LLBG 32.01138 34.886662
AFR Air France Paris Charles de Gaulle Airport LFPG 49.01252 2.555752
KLM KLM Amsterdam Schiphol Airport EHAM 52.30861 4.763889
EZY EasyJet Catania Fontanarossa Airport LICC 37.46678 15.066400
HOP HOP! Figari Sud-Corse Airport LFKF 41.50222 9.096667
EZY EasyJet Marrakesh Menara Airport GMMX 31.60688 -8.036300
VOE Volotea Bastia Poretta Airport LFKB 42.55000 9.484722
VOE Volotea Ajaccio Napoleon Bonaparte Airport LFKJ 41.92388 8.802500
VOE Volotea Tenerife South Airport GCTS 28.04447 -16.572399
EZY EasyJet Basel Mulhouse-Freiburg EuroAirport LFSB 47.59890 7.528300
IBE Iberia Madrid Barajas Airport LEMD 40.49355 -3.566760
VLG Vueling Barcelona El Prat Airport LEBL 41.29707 2.078463
BAW British Airways London Gatwick Airport EGKK 51.14805 -0.190270
AFR Air France Paris Charles de Gaulle Airport LFPG 49.01252 2.555752
AFR Air France Paris Orly Airport LFPO 48.72333 2.379444
VOE Volotea Split Airport LDSP 43.53894 16.297960
EZY EasyJet Lyon Saint Exupery Airport LFLL 45.71964 5.089108
VOE Volotea Figari Sud-Corse Airport LFKF 41.50222 9.096667
RYR Ryanair London Stansted Airport EGSS 51.88500 0.235000
BTI Air Baltic Riga International Airport EVRA 56.92361 23.971109
AFR Air France Paris Charles de Gaulle Airport LFPG 49.01252 2.555752
TAR Tunisair Tunis Carthage International Airport DTTA 36.85103 10.227210
EZY EasyJet Venice Marco Polo Airport LIPZ 45.50527 12.351940
EZY EasyJet London Gatwick Airport EGKK 51.14805 -0.190270
TAR Tunisair Djerba Zarzis International Airport DTTJ 33.87503 10.775460
KLM KLM Amsterdam Schiphol Airport EHAM 52.30861 4.763889
TAP TAP Portugal Lisbon Humberto Delgado Airport LPPT 38.78131 -9.135910
AFR Air France Paris Orly Airport LFPO 48.72333 2.379444
EZY EasyJet Faro Airport LPFR 37.01442 -7.965910
BEE Flybe Birmingham Airport EGBB 52.45385 -1.748020
THY Turkish Airlines Istanbul Ataturk International Airport LTBA 40.97692 28.814600
HOP HOP! Nice Cote d’Azur Airport LFMN 43.66527 7.215000
EZY EasyJet Milan Malpensa Airport LIMC 45.63060 8.728111
AFR Air France Paris Charles de Gaulle Airport LFPG 49.01252 2.555752
IBE Iberia Madrid Barajas Airport LEMD 40.49355 -3.566760
VOE Volotea Palma de Mallorca Airport LEPA 39.55167 2.738808
EZY EasyJet Luxembourg Findel Airport ELLX 49.62333 6.204444
VOE Volotea Ajaccio Napoleon Bonaparte Airport LFKJ 41.92388 8.802500
EZY EasyJet Bristol Airport EGGD 51.38266 -2.719080
HOP HOP! Rome Leonardo da Vinci Fiumicino Airport LIRF 41.80447 12.250790
EZY EasyJet Lisbon Humberto Delgado Airport LPPT 38.78131 -9.135910
VOE Volotea Alicante Airport LEAL 38.28216 -0.558150
RYR Ryanair Brussels South Charleroi Airport EBCI 50.46000 4.452778
AFR Air France Paris Orly Airport LFPO 48.72333 2.379444
EZY EasyJet Barcelona El Prat Airport LEBL 41.29707 2.078463
VOE Volotea Venice Marco Polo Airport LIPZ 45.50527 12.351940
EIN Aer Lingus Dublin Airport EIDW 53.42138 -6.270000
HOP HOP! Marseille Provence Airport LFML 43.43666 5.215000
KLM KLM Amsterdam Schiphol Airport EHAM 52.30861 4.763889
EZY EasyJet London Luton Airport EGGW 51.87472 -0.368330
EZY EasyJet Nice Cote d’Azur Airport LFMN 43.66527 7.215000
HOP HOP! Lyon Saint Exupery Airport LFLL 45.71964 5.089108
EZY EasyJet Geneva International Airport LSGG 46.23806 6.108950
DAH Air Algerie Algiers Houari Boumediene Airport DAAG 36.69101 3.215408
AFR Air France Paris Charles de Gaulle Airport LFPG 49.01252 2.555752
HOP HOP! Lyon Saint Exupery Airport LFLL 45.71964 5.089108
EZY EasyJet Berlin Schonefeld Airport EDDB 52.38000 13.522500
IBE Iberia Madrid Barajas Airport LEMD 40.49355 -3.566760
EZY easyJet Brussels Airport EBBR 50.90138 4.484444
RAM Royal Air Maroc Casablanca Mohammed V International Airport GMMN 33.36746 -7.589960
AFR Air France Paris Orly Airport LFPO 48.72333 2.379444
VOE Volotea Strasbourg Airport LFST 48.54361 7.637222
DLH Lufthansa Frankfurt Airport EDDF 50.02642 8.543125
EZY EasyJet Geneva International Airport LSGG 46.23806 6.108950
EZY EasyJet Lille Airport LFQQ 50.56333 3.086944
HOP HOP! Lille Airport LFQQ 50.56333 3.086944
VLG Vueling Barcelona El Prat Airport LEBL 41.29707 2.078463
NAX Norwegian Oslo Gardermoen Airport ENGM 60.19391 11.100360
SWR Swiss Zurich Airport LSZH 47.46472 8.549167
VOE Volotea Pisa Galileo Galilei Airport LIRP 43.68391 10.392750
EZY EasyJet Lyon Saint Exupery Airport LFLL 45.71964 5.089108
HOP HOP! Marseille Provence Airport LFML 43.43666 5.215000
EZY EasyJet Amsterdam Schiphol Airport EHAM 52.30861 4.763889
AFR Air France Paris Charles de Gaulle Airport LFPG 49.01252 2.555752
AFR Air France Paris Orly Airport LFPO 48.72333 2.379444
HOP HOP! Lyon Saint Exupery Airport LFLL 45.71964 5.089108
EZY EasyJet Lyon Saint Exupery Airport LFLL 45.71964 5.089108
EZY EasyJet Basel Mulhouse-Freiburg EuroAirport LFSB 47.59890 7.528300
EZY EasyJet London Gatwick Airport EGKK 51.14805 -0.190270
BEL Brussels Airlines Brussels Airport EBBR 50.90138 4.484444
EZY EasyJet Geneva International Airport LSGG 46.23806 6.108950
EZY EasyJet Nice Cote d’Azur Airport LFMN 43.66527 7.215000
HOP HOP! Lyon Saint Exupery Airport LFLL 45.71964 5.089108
AFR Air France Paris Orly Airport LFPO 48.72333 2.379444
EZY EasyJet Lille Airport LFQQ 50.56333 3.086944

Exercices

  • Get all pages of data by generating the correct query to API :)

Docker !

This is the ultimate and probably the most complex part of this big tutorial.

In real webscraping project, there are two possible use case : a one shoot harvest, or a daily/monthly/etc. harvest of data.

If you need to collect one year of data on a daily basis, you cannot use your personnel computer. You need to connect and run your from a distant server.

To be really really short on subject, Docker is a technology which encapsulate your software into an isolated (and if possible immutable) container on the top of your system. The concept is similar to virtual machine (VM), but more efficient.

Here we are, we use this Docker container technology to encapsulate a webscrapping script. After that you could save your and launch it on a webserver.

There are three big step to understand in container lifecycle:

  • We describe the composition of an image into a Dockerfile file using special Docker syntax. It’s like a recipe into cookbook. For example, you could find lot of recipes on this site : DockerHub.

  • Next, like a recipe in the real life, you need to concretize this recipe into some delicious cake. Image need to be built before usage.

  • Finally, you run the builted image.

Linux way

DOCKER installation

On linux Ubuntu, you found documentation here. First step, install the key and repository.

sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \
    software-properties-common

Add key and repository :

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"

sudo apt-get update

Install docker-ce :

sudo apt-get install docker-ce

PREPARE image

Copy the folder docker-images on the USB Key (ask teachers) into the scrap-flightradar folder of this tutorial.

Now, go to this folder using terminal command (cd pathofthefolder), and load the two images on your system.

sudo docker load --input=r-alpine.tar
sudo docker load --input=rSelenium.tar

BUILD image

Go to docker-scripts folder into the folder which contain this tutorial on your disk.

The building of this image take lot of times (ten minutes), this is due to the huge dplyr library. Run the docker build command in the folder which contain the Dockerfile description of the image.

docker build . --tag=rflightscraps

LAUNCH IMAGE

  1. Using a binded volume, this is the easiest way actually. Open your terminal, go to the folder of your script. Create a new folder named localbackup and run the container rflightscraps with correct path.
mkdir localbackup
docker run --name rflightscraps -d -e UID=1000 -e GID=1000 --mount type=bind,source=$(pwd)/localbackup,destination=/usr/local/src/flight-scrap/docker-scripts/data rflightscraps --name rflightscraps

To see if your container is running and consult the logs of execution :

sudo docker ps
sudo docker logs rflightscraps

To consult the result of automatic harvesting, consult the docker-scripts/localbackup folder using ls unix command. You see a list of csv which correspond to harvest made every minute. If you want to change this, you need to modify the crontab file following the cron syntax, and rebuild/relaunch the image (it take less time, because you only modify one file, no need to recompile).

  1. Same thing, but using a named volume, a more portable way to share data between docker container, but it lacks some features on permissions to correctly export data.

Create a named volume, independent from filesystem

docker volume create --name myDataVolume
docker volume ls

Mount the volume :

docker run --mount type=volume,source=myDataVolume,destination=/usr/local/src/flight-scrap/docker-scripts/data rflightscrap

Export data :

  • Using a alpine image, we mount the named volume (myDataVolume) to a /alpine_data folder inside the alpine container.

  • Then, we create a new folder inside the alpine container named /alpine_backup.

  • We then create an archive containing the contents of the /alpine_data folder and we store it inside the /alpine_backup folder (inside the container).

  • We also mount the /alpine_backup folder from the container to the docker host (your local machine) in a folder named /local_backup inside the current directory.

docker run --rm -v myDataVolume:/alpine_data -v $(pwd)/local_backup:/alpine_backup alpine:latest tar cvf /alpine_backup/scrap_data_"$(date '+%y-%m-%d')".tar /alpine_data

Windows way

Thre are two way to install Docker for windows, a new way (https://docs.docker.com/docker-for-windows/install/) and an old way. For this tutorial we use the the old way due to better compatibility.

Install Docker Tools for windows using the DockerToolbox.exe file. You could find the official documenation is available here

After that you could launch Docker quickstart terminal directly after installation or using the icon in start menu.

Docker first download an iso, and after that test if your system is ready to run containers. If you see an error like this, you need to run another step.

Restart your computer, and try to activate an option in the BIOS (Del key during initialization of your computer) probably named “Vanderpool technology” or “VT-X technology” or “Virtualization technology”. Save and restart. Some pictore for UEFI Bios on HP, DELL, ASUS motherboard/systems.

Asus

Dell

HP

PREPARE image

Copy docker-scripts and docker-images folders into c:\Program Files\Docker Toolbox

After that, into Terminal of Docker Toolbox you see this folders.

Go to docker-images folder using cd command, and load the two images :

docker load --input=r-alpine.tar
docker load --input=rSelenium.tar

BUILD image

Go to docker-scripts folder into the folder which contain this tutorial on your disk.

The building of this image take lot of times (ten minutes), this is due to the huge dplyr library. Run the docker build command in the folder which contain the Dockerfile description of the image.

docker build . --tag=rflightscraps

LAUNCH IMAGE

We use a binded volume, this is the easiest way actually.

First, create a new folder named localbackup into your users folder on windows : C:\Users\yourname After that, change the path by yours in this command and run it.

docker run --name rflightscraps -d -e UID=1000 -e GID=1000 --mount type=bind,source=/c/Users/reyse/localbackup,destination=/usr/local/src/flight-scrap/docker-scripts/data rflightscraps --name rflightscraps 

The end, close the session !

remDr$close()